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Linear principal component analysis (PCA) can be extended to a nonlinear 
CSJ PCA by using artificial neural networks. But the benefit of curved compo- 

nents requires a careful control of the model complexity. Moreover, standard 
techniques for model selection, including cross-validation and more generally 

^ the use of an independent test set, fail when applied to nonlinear PCA be- 

Qh cause of its inherent unsupervised characteristics. 

This paper presents a new approach for validating the complexity of nonlin- 
£T} ear PCA models by using the error in missing data estimation as a criterion 

for model selection. It is motivated by the idea that only the model of opti- 
mal complexity is able to predict missing values with the highest accuracy. 
While standard test set validation usually favours over-fitted nonlinear PCA 

^ models, the proposed model validation approach correctly selects the optimal 

O model complexity. 
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^ 1 Introduction 

Nonlinear principal component analysis [1, 2, 3] is a nonlinear generalization of standard 
principal component analysis (PCA). While PCA is restricted to linear components, 
nonlinear PCA generalizes the principal components from straight lines to curves and 
hence describes the inherent structure of the data by curved subspaces. Detecting and 
describing nonlinear structures is especially important for analysing time series. Non- 
linear PCA is therefore frequently used to investigate the dynamics of different natural 
processes [4, 5, 6]. But validating the model complexity of nonlinear PCA is a difficult 
task [7]. Over-fitting can be caused by the often limited number of available samples; 
moreover, in nonlinear PCA over-fitting can also occur by the intrinsic geometry of the 
data, as shown in Fig. 5, which cannot be solved by increasing the number of samples. A 
good control of the complexity of the nonlinear PCA model is required. We have to find 
the optimal flexibility of the curved components. A component with too little flexibility, 
an almost linear component, cannot follow the complex curved trajectory of real data. 
By contrast, a too flexible component fits non-relevant noise of the data (over-fitting) 
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and hence gives a poor approximation of the original process, as illustrated in Fig. 1A. 
The objective is to find a model whose complexity is neither too small nor too large. 
Even though the term nonlinear PCA (NLPCA) is often referred to the auto-associative 
neural network approach, there are many other methods which visualise data and ex- 
tract components in a nonlinear manner [8]. Locally linear embedding (LLE) [9, 10] and 
Isomap [11] visualise high dimensional data by projecting (embedding) them into a two 
or three-dimensional space. Principal curves [12] and self organising maps (SOM) [13] 
describe data by nonlinear curves and nonlinear planes up to two dimensions. Kernel 
PCA [14] as a kernel approach can be used to visualise data and for noise reduction [15]. 
In [16] linear subspaces of PCA are replaced by manifolds and in [17] a neural network 
approach is used for nonlinear mapping. This work is focused on the auto-associative 
neural network approach to nonlinear PCA and its model validation problem. 
For supervised methods, a standard validation technique is cross-validation. But even 
though the neural network architecture used is supervised, the nonlinear PCA itself is 
an unsupervised method that requires validating techniques different from those used 
for supervised methods. A common approach for validating unsupervised methods is to 
validate the robustness of the components under moderate modifications of the original 
data set, e.g., by using resampling bootstrap [18] or by corrupting the data with a small 
amount of Gaussian noise [19]. In both techniques, the motivation is that reliable com- 
ponents should be robust and stable against small random modification of the data. In 
principle, these techniques could be adapted to nonlinear methods. But there would be 
the difficulty of measuring the robustness of nonlinear components. Robustness of linear 
components is measured by comparing their directions under slightly different conditions 
(resampled data sets or different noise-injections). But since comparing the curvature of 
nonlinear components is no trivial task, nonlinear methods require other techniques for 
model validation. 

In a similar neural network based nonlinear PCA model, termed nonlinear factor analysis 
(NFA) [20], a Bayesian framework is used in which the weights and inputs are described 
by posterior probability distributions which leads to a good regularisation. While in 
such Bayesian learning the inputs (components) are explicitly modelled by Gaussian 
distributions, the maximum likelihood approach in this work attempts to find a single 
set of values for the network weights and inputs. A weight-decay regulariser is used 
to control the model complexity. There are several attempts to the model selection in 
the auto-associative nonlinear PCA. Some are based on a criterion of how good the 
local neighbour relation is preserved by the nonlinear PCA transformation [21]. In [22], 
a nearest neighbour inconsistency term that penalises complex models is added to the 
error function, but standard test set validation is used for model pre-selection. In [23] 
an alternative network architecture is proposed to solve the problems of over-fitting and 
non-uniqueness of nonlinear PCA solutions. Here we consider a natural approach that 
validates the model by its own ability to estimate missing data. Such missing data 
validation is used, e.g., for validating linear PCA models [24], and for comparing proba- 
bilistic nonlinear PCA models based on Gaussian processes [25]. Here, the missing data 
validation approach is adapted to validate the auto-associative neural network based 
nonlinear PCA. 
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Figure 1: The problem of mis- validation of nonlinear PC A by using a test data set. 
Ten training samples were generated from a quadratic function (dotted line) plus noise. 
(A) A nonlinear PCA model of too high complexity leads to an overfitted component 
(solid curve). But validating this over-fitted model with an independent test data 
set (B) gives a better (smaller) test error than using the original model from which the 
data were generated (C). 

2 The test set validation problem 

To validate supervised methods, the standard approach is to use an independent test 
set for controlling the complexity of the model. This can be done either by using a new 
data set, or when the number of samples is limited, by performing cross-validation by 
repeatedly splitting the original data into a training and test set. The idea is that only 
the model, which best represents the underlying process, can provide optimal results 
on new, for the model previously unknown, data. But test set validation only works 
well when there exist a clear target value (e.g., class labels) as in supervised methods, 
it fails on unsupervised methods. In the same way that a test data set cannot be used 
to validate the optimal number of components in standard linear PCA, test data also 
cannot be used to validate the curvature of components in nonlinear PCA [7]. Even 
though nonlinear PCA can be performed by using a supervised neural network archi- 
tecture, it is still an unsupervised method and hence should not be validated by using 
cross-validation. With increasing complexity, nonlinear PCA is able to provide a curved 
component with better data space coverage. Thus, also test data can be projected onto 
the (over-fitted) curve by a decreased distance and hence give an incorrect small error. 
This effect is illustrated in Fig. 1 using 10 training and 200 test samples generated from a 
quadratic function plus Gaussian noise of standard deviation a = 0.4. The mean square 
error (MSE) is given by the mean of the squared distances E =|| x — x || 2 between the 
data points x and their projections x onto the curve. The over-fitted and the well-fitted 
or ideal model are compared by using the same test data set. It turns out that the 
test error of the true original model (Fig. 1C) is almost three times larger than the test 
error of the overly complex model (Fig. IB), which over- fits the data. Test set validation 
clearly favours the over-fitted model over the correct model, and hence fails to validate 
nonlinear PCA. 

To understand this contradiction, we have to distinguish between an error in super- 
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Figure 2: The standard auto-associative neural network for nonlinear PCA. The net- 
work output x is required to approximate the input x. Illustrated is a 3-4-1-4-3 network 
architecture. Three-dimensional samples x are compressed to one component value z in 
the middle by the extraction part. The inverse generation part reconstructs x from z. 
The sample x is a noise-reduced representation of x, located on the component curve. 

vised learning and the fulfilment of specific criteria in unsupervised learning. Test set 
validation works well for supervised methods because we measure the error as the differ- 
ence from a known target (e.g., class labels). Since in unsupervised methods the target 
(e.g., the correct component) is unknown, we optimize a specific criterion. In nonlinear 
PCA the criterion is to project the data by the shortest distance onto a curve. But 
a more complex over-fitted curve covers more data space and hence can also achieve a 
smaller error on test data than the true original curve. 

3 The nonlinear PCA model 

Nonlinear PCA (NLPCA) can be performed by using a multi-layer perceptron (MLP) of 
an auto-associative topology, also known as auto-encoder, replicator network, bottleneck, 
or sand-glass type network, see Fig. 2. 

The auto-associative network performs an identity mapping. The output x is forced to 
approximate the input x by minimising the squared reconstruction error E =\\ x — x || 2 . 
The network can be considered as consisting of two parts: the first part represents 
the extraction function & e xtr : X Z, whereas the second part represents the inverse 
function, the generation or reconstruction function Q gen : Z — > X. A hidden layer in each 
part enables the network to perform nonlinear mapping functions. By using additional 
units in the component layer in the middle, the network can be extended to extract 
more than one component. Ordered components can be achieved by using a hierarchical 
nonlinear PCA [26] . 

For the proposed validation approach, we have to adapt nonlinear PCA to be able to 
estimate missing data. This can be done by using an inverse nonlinear PCA model [27] 
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which optimises the generation function by using only the second part of the auto- 
associative neural network. Since the extraction mapping X — >■ Z is lost, we have to 
estimate both the weights w and also the inputs z which represent the values of the 
nonlinear component. Both w and z can optimised simultaneously to minimise the 
reconstruction error, as shown in [27]. 

The complexity of a model can be controlled by a weight-decay penalty term [28] added 
to the error function E to tai = E + v w i)i w are the network weights. By varying the 
coefficient u, the impact of the weight-decay term can be changed and hence we modify 
the complexity of the model which defines the flexibility of the component curves in 
nonlinear PCA. 



4 The missing data validation approach 

Since classical test set validation fails to select the optimal nonlinear PCA model, as 
illustrated in Fig. 1, I propose to evaluate the complexity of a model by using the error 
in missing data estimation as the criterion for model selection. This requires to adapt 
nonlinear PCA for missing data as done in the inverse nonlinear PCA model [27]. The 
following model selection procedure can be used to find the optimal weight-decay com- 
plexity parameter v of the nonlinear PCA model: 

1. Choose a specific complexity parameter v. 

2. Apply inverse nonlinear PCA to a training data set. 

3. Validate the nonlinear PCA model by its performance on missing data estimation of 
an independent test set in which one or more elements xf of a sample x™ are randomly 
rejected. The mean of the squared errors ef =|| xf — xf || 2 between the randomly re- 
moved values xf and their estimations xf by the nonlinear PCA model is used as the 
validation or generalization error. 

Applied to a range of different weight-decay complexity parameters v, the optimal model 
complexity v is given by the lowest missing value estimation error. To get a more robust 
result, for each complexity setting, nonlinear PCA can be repeatedly applied by using 
different weight-initializations of the neural network. The median can then be used for 
validation as shown in the following examples. 



5 Validation examples 

The first example of a nonlinear data set shows that model validation based on missing 
data estimation performance provides a clear optimum of the complexity parameter. 
The second example demonstrates that the proposed validation ensures that nonlinear 
PCA does not describe data in a nonlinear way when the inherent data structure is, in 
fact, linear. 
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Figure 3: Helical data set. Data are generated from a one dimensional helical loop 
embedded in three dimensions and additive Gaussian noise. The nonlinear component 
is plotted as a red line. Red circles show the projections of the data (blue dots) onto 
the curve. 

5.1 Helix data 

The nonlinear data set consist of data x = (xi,X2,xs) T that lie on a one-dimensional 
manifold, a helical loop, embedded in three dimensions, plus Gaussian noise r] of stan- 
dard deviation a = 0.4 , as illustrated in Fig. 3. The samples x were generated from a 
uniformly distributed factor t over the range [-0.8,0.8], t represents the angle: 

x\ = sin(7rt) + r] 
X2 = cos(irt) + Tj 
x 3 = t + T] 

Nonlinear PCA is applied by using a 1-10-3 network architecture optimized in 5,000 
iterations by using the conjugate gradient descent algorithm [29] . 

To evaluate different weight-decay complexity parameters u, nonlinear PCA is applied 
to 20 complete samples generated from the helical loop function and validated by using 
a missing data set of 1,000 incomplete samples in which randomly one value of the three 
dimensions is rejected per sample and can be easily estimated from the other two di- 
mensions when the nonlinear component has the correct helical curve. For comparison 
with standard test set validation, the same 1,000 (complete) samples are used. This is 



Validation 

0.5 I 1 1 r— 




weight-decay 

low high 

model complexity 

Figure 4: Model selection. Missing data estimation is compared with standard test 
set validation by using the helical data set (Fig. 3). A nonlinear PC A network model of 
low complexity which is almost linear (left) results in a high error as expected for both 
the training and the test data. Only the missing data approach shows the expected 
increase in validation error for over-fitted models (right). 

repeatedly done 100 times for each model complexity with newly generated data each 
time. The median of missing data estimation over all 100 runs is finally taken to validate 
a specific model complexity. 

Fig. 4 shows the results of comparing the proposed model selection approach with stan- 
dard test set validation. It turns out that only the missing data approach is able to 
show a clear minimum in the performance curve. Test set validation, by contrast, shows 
a small error even for very complex (over- fitted) models. This is contrary to our ex- 
perience with supervised learning, where the test error becomes large again when the 
model over-fits. Thus, test set validation cannot be used to determine the optimal model 
complexity of unsupervised methods. In contrast, the missing value validation approach 
shows that the optimal complexity setting of the weight-decay coefficient is in the range 
0.01 < v < 0.0001. 

5.2 Linear data 

Nonlinear PCA can also be used to answer the question of whether high- dimensional 
observations are driven by an unimodal or a multimodal process, e.g., in atmospheric 
science for analysing the El Nino-Southern Oscillation [30] . But applying nonlinear PCA 
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Figure 5: Nonlinear PC A is applied to data of a two-dimensional Gaussian distribution. 
(C,D) Over-fitted models. (B) A weight-decay of 0.01 forces nonlinear PCA to describe 
the data by a linear component. (A) An even stronger penalty of 1.0 forces a single 
point solution. Below: Only the missing data approach shows that it would be best to 
impose a very strong penalty that forces the network into a linear solution. 



can be misleading if the model complexity is insufficiently controlled: multimodality can 
be incorrectly detected in data that are inherently unimodal, as pointed out by Chris- 
tiansen [7]. Fig. 5C&D illustrates that if the model complexity is too high, even linear 
data is described by nonlinear components. Therefore, to obtain the right description of 
the data, controlling the model complexity is very important. Fig. 5 shows the valida- 
tion error curves of the standard test set and the proposed missing data validation for 
different model complexities. The median of 500 differently initialized 1-4-2 networks is 
plotted. Again, it is shown that standard test set validation fails in validating nonlinear 
PCA. With increasing model complexity, classical test set validation shows an decreasing 
error, and hence favours over-fitted models. By contrast, the missing value estimation 
error shows correctly that the optimum would be a strong penalty which gives a linear 
or even a point solution, thereby confirming the absence of nonlinearity in the data. 
This is correct because the data consists, in principle, of Gaussian noise centred at the 
point (0,0). 
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While test set validation favours over-fitted models which produce components that in- 
correctly show multimodal distributions, missing data validation confirms the unimodal 
characteristics of the data. Nonlinear PCA in combination with missing data validation 
can therefore be used to find out whether a high-dimensional data set is generated by a 
unimodal or a multimodal process. 

6 Test set versus missing data approach 

In standard test set validation, the nonlinear PCA model is trained using a training set X. 
An independent test set y is then used to compute a validation error as E =\\ y — y \\ 2 , 
where y is the output of the nonlinear PCA given the test data y as the input. The 
test set validation reconstructs the test data from the test data itself. The problem with 
this approach is that increasingly complex functions can give approximately y = y, thus 
favouring complex models. While test set validation is a standard approach in supervised 
applications, in unsupervised techniques it suffers from the lack of a known target (e.g., a 
class label). Highly complex nonlinear PCA models, which over- fit the original training 
data, are in principle also able to fit test data better than would be possible by the true 
original model. With higher complexity, a model is able to describe a more complicated 
structure in the data space. Even for new test samples, it is more likely to find a short 
projecting distance (error) onto a curve which covers the data space almost complete 
than by a curve of moderate complexity (Fig. 1). The problem is that we can project 
the data onto any position on the curve. There is no further restriction in pure test set 
validation. In missing data estimation, by contrast, the required position on the curve 
is fixed, given by the remaining available values of the same sample. The artificially 
removed missing value of a test sample gives an exact target which have to be predicted 
from the available values of the same test sample. While test set validation predicts the 
test data from the test data itself, the missing data validation predicts removed values 
from the remaining values of the same sample. Thus, we transform the unsupervised 
validation problem into a kind of supervised validation problem. 

7 Conclusion 

In this paper, the missing data validation approach to model selection is proposed to be 
applied to the auto-associative neural network based nonlinear PCA. The idea behind 
this approach is that the true generalization error in unsupervised methods is given by 
a missing value estimation error and not by the classical test set error. The proposed 
missing value validation approach can therefore be seen as an adaptation of the standard 
test set validation so as to be applicable to unsupervised methods. The absence of a 
target value in unsupervised methods is replaced by using artificially removed missing 
values as expected target values that have to be predicted from the remaining values 
of the same sample. It can be shown that standard test set validation clearly fails to 
validate nonlinear PCA. In contrast, the proposed missing data validation approach was 
able to validate correctly the model complexity. 
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Availability of Software 



A MATLAB® implementation of nonlinear PCA including the inverse model for esti- 
mating missing data is available at: 
http : //www . NLPCA . org/matlab . html 

An example of how to apply the proposed validation approach can be found at: 
http : //www . NLPCA . org/validation . html 
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